System for graph-based clustering of documents

ABSTRACT

System for graph-based clustering of documents. The system comprises one or more processors configured to receive a digital copy of a document to convert the document into a graph object. Further, the processor is configured to identify and label entities in the document, wherein each of the entities is represented as a node of the graph object. Further, the processor is configured to create the graph object for the received digital copy of the document and generate a graph embedding vector using a graph embedding neural network trained to receive the graph object as input and generate the graph embedding vector for the graph object as output. Finally, the processor is configured to cluster the graph embedding vector to a cluster comprising similar looking templates of the document.

BACKGROUND Field of Invention

The disclosed subject matter relates to the field of document templateidentification and classification. More particularly, but notexclusively, the subject matter relates to the field of processing thedigital documents to be fed as input for a machine learning model.

Discussion of Prior Art

The rise of internet and the computerized systems has pushed the worldinto a digital era where the conventional manually enabled processes areautomized using digital systems thereby improving the accuracy andefficiency of the processes. One such field that has seen thistransformation is handling of documents to classify the documents,wherein computerized systems are used to digitally process the documentsto classify the documents accurately. These systems must be equippedwith the capacity to process different types of documents havingdifferent templates.

As an example, documents such as application forms, and financialtransactions bills vary from one document to another document and eachdocument carries specific information related to the document itself.One of the specific information in the document may be the company logoor name. The information may be unique to the document as well assimilar to the other documents. The specific information can be groupedinto unique and non-unique information. The unique information may be atransaction number or the date and time when the document was generated.For example, some of the non-unique or similar information such as‘date’, ‘time’, and ‘price’ may be explicitly mentioned in the document.As an example, this kind of document includes print-out receipt foritems purchased in a shop. Similarly, the common information across anapplication form carries these general terms namely name, age, sex, andaddress. The text present in the form specifies the spatial location orplace, where the relevant information has to be filed in place of theblank space in the document. The location of the text would have beendecided by the document creator to collect or record the informationfrom the document users. The location of the text varies from documentto document. The variation of text may be in the form of uniqueinformation or non-unique information. If there is no variation in thelocation of non-unique and same textual information, then the documentscan be grouped together and assume that the grouped documents belong toa particular kind of structure which is referred to as a template. Whenthe location of non-unique information may or may not be the same andvaries between the documents, then we can assume that the documentsbelong to different templates. The non-unique information has to be thesame in terms of text to form a template.

In order to train a language model or a machine learning model or amachine to understand all the possible complex variations for aparticular template, then it is necessary to provide the trainingsamples from that particular template that covers most of thevariations. One of the conventional ways to select documents belongingto a particular template from a group of documents is by clustering ofthe documents into groups based on templates. This is done by selectinga specific document and then comparing it with others in the group. Thisclustering operation enables in improving the performance of a machinelearning model by training the model on a particular template with mostof the variations from the training data set.

Typically, a couple of image-based processing algorithms are used togroup the documents, which are dependent on the image-based features tocluster the documents. The image-based features are extracted from theregion of interest in the document and used as a reference for groupingsimilar documents.

Similar approach has been extended to grouping documents based on wordfeatures using word vectors. The words in a document may include uniquetext and non-unique text. The problem arises when the digital documentis a scanned copy of a document and not digitally born document. Theremay be errors while performing OCR operations. The errors are erraticand may appear at random places. Therefore, there is a need to reducethe dependence on the non-unique and common texts.

Therefore, there is need for a clustering approach that considers thestructural information of the document for grouping similar documents toovercome the drawbacks of the conventional systems.

The documents used in the automation process may be of different types,like application forms, request forms, financial statements,authorization forms, and many more. The scanned copies of the originaldocument, digitally generated documents, or camera-captured documentsare the documents that are part of the automation process. A single typeof document comes with several variations or changes. Why? Initially, adocument may contain general information that doesn't reflect a specificcondition. Let us consider, an example, an address filled in the formwithout explicitly mentioning the city or the nearest city. If thedocument creator wants to collect the city information, then it will becovered in the document. So, a single type of document gets newadditions to the list of variations in the document. The number ofvariations in the documents keeps growing. There is no end to thepossible additions like an amendment to a byelaw. A single type has alot of complexities that may be difficult to understand by a model. Whendifferent types of documents are combined, then the complexity of themodel training increases by many folds. Here, we present a typicalexample of the types of documents.

Typically, to train a machine learning model, a set of documents areused as samples. One of the most prominent challenges regarding thetraining samples is variations within a document type and variousdocument types. The documents within the set of documents may differ andcause a lot of ambiguity and confusion for the machine learning model.For example, the model may assume that the date comes after theidentification number in a few documents, and the date may not appearafter the identification number in the others. The model predicts thedate with a probability score. Whenever the score is higher, then anynumber coming after the identification number is considered a date bythe model. This error produced by the model could be avoided byperforming a filtering operation on all the documents. The documentsreceived for training should be processed before passing them into themodel. It reduces the amount of complexity involved in the trainingprocess.

In a conventional system, during training a machine learning model, theclustered training data is used as an input. All the samples from eachof the clusters is fed as input to the model to identify a documentbelonging to the cluster. Inputting all the samples even though themodel as started to predict the document correctly is an unnecessaryprocess and consumes enormous computational time.

Hence, there is a need for a system to select appropriate samples and anappropriate number of samples to be fed as input to the machine learningmodel so that the drawbacks of the existing systems are overcome.

SUMMARY

In an embodiment, a system for graph-based clustering of documents isdisclosed. The system comprises one or more processors configured toreceive a digital copy of a document to convert the document into agraph object. Further, the processor is configured to identify and labelentities in the document, wherein each of the entities is represented asa node of the graph object. Further, the processor is configured tocreate the graph object for the received digital copy of the documentand generate a graph embedding vector using a graph embedding neuralnetwork trained to receive the graph object as input and generate thegraph embedding vector for the graph object as output. Finally, theprocessor is configured to cluster the graph embedding vector to acluster comprising similar looking templates of the document.

In an embodiment, a system for optimizing training dataset comprisingsample documents is disclosed. The system comprises one or moreprocessors configured to create graph embedding vector for each of thesample documents of the training dataset and cluster the graph embeddingvectors of the sample documents of the training dataset into clustersbased on the similarity between the graph embedding vectors. Further,the processor is configured to select a first set of training data,using an optimization model, wherein the first set of training datacomprises a finite number of graph embedding vectors of the sampledocuments from the clustered training dataset. Finally, the first set oftraining data is fed as for a machine learning model.

BRIEF DESCRIPTION OF DRAWINGS

Embodiments are illustrated by way of example and not limitation in thefigures of the accompanying drawings, in which like references indicatesimilar elements and in which:

FIG. 1 illustrates a block diagram of a system 100 for processing thedigital copies of documents.

FIG. 2 is a flowchart 200 of a method of processing the digital copiesof documents.

FIG. 3 is a flowchart 300 of a method of graph-based clustering approachfor grouping documents.

FIGS. 4A-4C illustrates the generation of graph object 406 for adocument 400 belonging to a first template.

FIGS. 5A-5C illustrates the generation of graph object 506 for adocument 500 belonging to a second template.

FIG. 6 illustrates an architecture 600 of a graph embedding neuralnetwork 112, in accordance with an embodiment.

FIG. 7 illustrates the flowchart 700 of training of the graph embeddingneural network 112, in accordance with an embodiment.

FIG. 8 is a flowchart 800 of optimizing training dataset, in accordancewith an embodiment.

FIG. 9 illustrates clustering of samples into clusters based onsimilarity.

FIG. 10 is a flowchart 1000 of the process of selecting a first set ofdata from the training dataset.

FIG. 11 illustrates the selection of samples from the cluster, inaccordance with an embodiment.

DETAILED DESCRIPTION

The following detailed description includes references to theaccompanying drawings, which form a part of the detailed description.The drawings show illustrations in accordance with example embodiments.These example embodiments, which may be herein also referred to as“examples” are described in enough detail to enable those skilled in theart to practice the present subject matter. However, it may be apparentto one with ordinary skill in the art, that the present invention may bepractised without these specific details. In other instances, well-knownmethods, procedures and components have not been described in detail soas not to unnecessarily obscure aspects of the embodiments. Theembodiments can be combined, other embodiments can be utilized, orstructural, logical, and design changes can be made without departingfrom the scope of the claims. The following detailed description is,therefore, not to be taken in a limiting sense, and the scope is definedby the appended claims and their equivalents.

In this document, the terms “a” or “an” are used, as is common in patentdocuments, to include one or more than one. In this document, the term“or” is used to refer to a nonexclusive “or,” such that “A or B”includes “A but not B,” “B but not A,” and “A and B,” unless otherwiseindicated.

Reference in the specification to “one embodiment” or to “an embodiment”means that a particular feature, structure, or characteristic describedin connection with the embodiments is included in at least oneembodiment of the invention, and multiple references to “one embodiment”or “an embodiment” should not be understood as necessarily all referringto the same embodiment.

FIG. 1 illustrates a block diagram of a system 100 for processing thedigital copies of documents. The system 100 may comprise of one or moreprocessor 102, a scanning module 104, input modules 106, output modules108, a memory module 110, a graph embedding neural network 112 and anoptimization module 114.

In an embodiment, the processor 102 may be implemented in the form ofone or more processors 102 and may be implemented as appropriate inhardware, computer-executable instructions, firmware, or combinationsthereof. Computer-executable instruction or firmware implementations ofthe processor 102 may include computer-executable or machine-executableinstructions written in any suitable programming language to perform thevarious functions described.

In an embodiment, the scanning module 104 may be configured to scan adocument and further convert it into a computer-readable format.

In an embodiment, the input modules 106 may provide an interface forinput devices such as keypad, touch screen, mouse and stylus among otherinput devices. The input modules 106 may include camera or scanner.

In an embodiment, the output modules 108 may provide an interface foroutput devices such as display screen, speakers, printer and hapticfeedback devices, among other output devices.

In an embodiment, the memory module 110 may include a permanent memorysuch as hard disk drive, may be configured to store data, and executableprogram instructions that are implemented by the processor 102. Thememory module 110 may be implemented in the form of a primary and asecondary memory. The memory module 110 may store additional data andprogram instructions that are loadable and executable on the processor102, as well as data generated during the execution of these programs.Further, the memory module 110 may be volatile memory, such asrandom-access memory and/or a disk drive, or non-volatile memory. Thememory module 110 may comprise of removable memory such as a CompactFlash card, Memory Stick, Smart Media, Multimedia Card, Secure Digitalmemory, or any other memory storage that exists currently or may existin the future.

In an embodiment, the graph embedding neural network 112 may beconfigured to generate a graph embedding of a graph object correspondingto a document.

In an embodiment, the optimization module 114 may be configured tooptimize the clustered documents.

FIG. 2 is a flowchart of a method of processing the digital copies ofdocuments. At step 202, the digital copies of the input documents to befed as input to the machine learning is received by the processor 102.

At step 204, the received documents may be clustered by the processor102 using a graph-based clustering process. The graph-based clusteringprocess will be explained in greater detail later. The outcome of thegraph-based clustering process is that the documents are clustered intoclusters, wherein each clusters includes documents belonging to asimilar template.

At step 206, the processor 102 may be configured to optimize theclustered documents using the optimization module 114. The optimizationof the documents will be explained in a greater detail later.

At step 208, the optimized documents are fed as an input to the machinelearning model for classifying the documents or named entities withinthe documents.

Graph-Based Clustering Approach for Grouping Documents

FIG. 3 is a flowchart of a method of graph-based clustering approach forgrouping documents. At step 302, the processor 102 may be configured toreceive a digital copy of a document that is to be converted into agraph object.

At step 304, the processor 102 may be configured to identify and labelentities in the document, wherein each of the entities in the documentare represented as a node of the graph object.

In one embodiment, the labelling of the entities in the document may bemanually performed by a human user.

In one embodiment, the labelling of the entities in the document may beperformed by a supervised learning-based machine learning model. AGeneric Named Entity Recognition (GNER) engine is configured to labelthe documents automatically without manual intervention. The engine maybe trained to label different types of general entities such as aperson, location, organization, amount, date, and time. The use of theGNER engine, enables speeding up of the overall process of grouping thedocuments which are complex in nature. The entity information from theengine is used to create the graph nodes representing the graph objectof a document. The GNER engine speeds up the process of labellingentities either by manual annotation or by training a supervised machinelearning model.

At step 306, the processor 102 may be configured to create the graphobject for the received digital copy of the document. The graph objectfor the document may be created by connecting each of the nodesrepresenting an entity with its neighbouring nodes along fourdirections. The four directions may be top, bottom, left, right.

In one embodiment, the cartesian coordinate system is used to relate thenodes like x, y, w, and h, wherein x stands for the position of the nodefrom the left of the document, y stands for the position of the nodefrom the top of the document, and w stands for the width of the node,and h stands for the height of the node.

Upon connecting each of the nodes with its neighbouring nodes, edges maybe formed between each of the nodes and its neighbouring nodes alongfour directions.

The edges may be formed between the nodes based on the relative positionof each of the nodes with its neighbouring nodes.

In another embodiment, the difference between two graph objectscorresponding to two documents may be measured. In order to do that, thegraph edit distance (GED) between the graph objects may be computed. GEDis a measure of the nodes which are deleted, inserted, or substituted.Whenever, a delete, insertion, or substitution operation is performed, acertain amount of cost is added to the measurement in terms of graphedit distance. The GED computation is an NP-hard problem because nodesthat do not carry any relevant information between the documents whichare supposed to be part of the nodes are used. Using the entity typeinformation which is present in the node, a node from one document iscompared against the nodes in the other document. The semanticinformation of entity type as part of the node information may reducethe comparison from all the nodes in the other document to a set oflimited nodes which match the entity type. The cost between all thenodes is computed which is stored in a two-dimensional matrix form andthen a least-square fit algorithm is applied using the delete,insertion, or substitution operation to compute the lowest GED betweenthe two documents.

Typically, the computation of GED cost between two documents is timeconsuming. For example, considering there are 10 documents to cluster,then the GED cost between and for all the 10 documents must be computed.Suppose, the time consumed to compute GED cost is ‘x’ units. The timeconsumed to compute the pairwise GED cost matrix for 10 documents is10*9/2=45 times ‘x’ units. The computation time of the GED cost matrixincreases exponentially when the number of documents increases linearly.

At step 308, the processor 102 may be configured to generate a graphembedding vector for the graph object using a graph embedding neuralnetwork 112. The graph embedding neural network 112 may be trained toreceive the graph object as input and generate the graph embeddingvector for the graph object as output. The training of the graphembedding neural network 112 will be explained in greater detail later.

At step 310, the processor 102 may be configured to cluster the graphembedding vector to a cluster comprising similar looking templates ofthe document.

In one embodiment, the clustering approach, like partitional clusteringsuch as K-means clustering, hierarchical clustering such asagglomerative clustering, or spectral clustering may be used to clusterthe input documents.

FIGS. 4A-4C illustrates the generation of graph object 406 for adocument 400 belonging to a first template. Referring to FIG. 4A, theentities 402 are identified in the document by the processor 102.Referring to FIG. 4B, the entities are labelled 404 by the processor102. Referring to FIG. 4C, the entities are represented as a node 404and edges 408 are formed between the nodes to create a graph object 406for the document belonging to the first template.

FIGS. 5A-5C illustrates the generation of graph object 506 for adocument 500 belonging to a second template. As mentioned above, a graphobject 506 is created for the document 500 belonging to the secondtemplate.

Similarly, graph objects may be generated for every document that is fedas input to the system.

FIG. 6 illustrates an architecture 600 of a graph embedding neuralnetwork 112, in accordance with an embodiment. The graph embeddingneural network 112 while training may be a Siamese network. The graphembedding neural network 112 may comprise a first neural network 602identical to a second neural network 604. The first neural network 602may comprise a first encoder 606, a first graph neural network 608 and afirst pooling layer 610. The second neural network 604 may comprise asecond encoder 612, a second graph neural network 614 and a secondpooling layer 616.

In one embodiment, the first encoder 606 and the second encoder 612 maybe a multi-layer perceptron (MLP) that takes a graph object as input andprojects the entity type information of each node into a predefined sizevector, namely entity type embedding. The output vectors from theencoder are may then be used as node features for the first graph neuralnetwork 608 and the second graph neural network 614.

The graph neural network (608 and 614) may be used as a second stage inthe encoding process to cover the local and location information whichis missing in the first encoder and the second encoder. The local andlocation information of nodes such as its neighbours are encoded in thesecond stage. During the information propagation phase, every node willobtain the information from all its neighbour nodes at the currenttimestamp and use that information to update its own internalrepresentation. With the increase in the timestamp, the information ofthe nodes gets propagated to more nodes within the graph. The finaloutput of the graph neural network (608 and 614) is a set of rich noderepresentations that are learned to encode the structural informationabout the document.

The last stage is the first pooling layer 610 and the second poolinglayer 616. The pooling layer (610 and 616) may also be a multi-layerperceptron layer that learns to aggregate the node embeddings learnedfrom the graph neural network module (608 and 614) and produce apredefined size vector to represent the graph embedding of the inputgraph.

FIG. 7 illustrates the flowchart 700 of training of the graph embeddingneural network 112, in accordance with an embodiment. The graph objects702 for the training dataset are created, wherein the graph objects 702include the entities that are labelled. Further, GED matrix 704 may becomputed for a first batch of documents from the input training dataset.The processor 102 may configured to normalize the computed graph editdistance of the graph edit distance matrix to be between the range of 0to 1.

A pair of graph objects for two documents and the computed GED matrixmay be input to the graph embedding neural network 112. One of the pairof graph objects may be fed as input to the first neural network 602 andthe other graph object may be fed as input to the second neural network604. The graph embedding neural network 112 may generate a graphembedding vector (708 and 710) for each of the input graph objects.

In one embodiment, the generated graph embedding vector (708 and 710)may be a vector of the size 1×128.

A similarity score may be computed between the generated graph embeddingvectors (708 and 710) corresponding to two graph objects. The similarityscore may be calculated using a cosine similarity function.

The learning objective of the Siamese network is to learn to generatethe similarity score as close to the 1 minus pre-computed normalizedgraph edit distance as possible. By optimizing this learning objectivewith the backpropagation, the Siamese network may learn to move similargraph embedding closer to each other if their normalized graph editdistance is small and vice versa.

The computation of GED between nodes is replaced by the graph embeddingneural network 112, a neural network. The input to this graph embeddingneural network 112 is a graph object obtained after labelling theentities. The output of this network is a graph embedding vector of128-dimensions. The 128-dimensional vectors are used to form clusters ofdocuments. The threshold is used to limit the number of clusters basedon the selected clustering method.

The NP-hard problem of computation of GED distance is replaced with atrained neural network. The advantage of this method is a reduction intime and quick turnaround, flexibility, and more control in selecting athreshold value for the clustering process. The concept of the GEDdistance is replaced by a similarity score between the vectors that canbe computed quickly.

Optimizing Training Dataset Comprising Sample Documents

The system 100 may be configured to optimize the training dataset thatis clustered into clusters based on the template.

FIG. 8 is a flowchart 800 of optimizing training dataset, in accordancewith an embodiment. At step 802, a graph embedding vector for each ofthe documents of the training dataset may be created using the processor102.

At step 804, the processor 102 may cluster the graph embedding vectorsinto clusters based on the similarity between the graph embeddingvectors. The clustering of the input documents may be performed usingthe foresaid graph-based clustering approach.

In one embodiment, the clustering approach, like partitional clusteringsuch as K-means clustering, hierarchical clustering such asagglomerative clustering, or spectral clustering may be used to clusterthe input documents.

Referring to FIG. 9, the input samples 902 belonging to differenttemplates (represented by different symbols) are segregated intodifferent clusters 906 based on the similarity between the samples.

At step 806, an optimization module 114 may be configured to select afirst set of training data from the clustered training dataset. Thefirst set of training data may be a finite number of graph embeddingvectors of the sample documents from the clustered training dataset.

At step 808, the selected first set of training data may be fed as inputto a machine learning model. Therefore, only a part of the trainingdataset is input to machine learning model as training data therebyreducing the training time and operational costs.

FIG. 10 is a flowchart 1000 of the process of selecting a first set ofdata from the training dataset. At step 1002, the clustered trainingdataset may be fed as an input to the optimization module 114. The inputclustered training dataset may be graph embedding vectors of the samplesof the training dataset.

At step 1004, the processor 102 may select a cluster among multipleclusters of the training dataset.

At step 1006, the processor 102 may determine the size of the selectedcluster, wherein the size of the cluster represents the number of graphembedding vectors of the sample documents in the cluster.

At step 1008, the processor 102 may determine whether the size clusteris less than a predefined lower threshold value.

If the size of the cluster is less than a predefined lower thresholdvalue, at step 1010, the cluster may be ignored from being used andsamples in the cluster are retained as an input training data for themachine learning model.

In one embodiment, the lower threshold value may be 100. In other words,if the number of samples in the cluster is less than 100, then thecluster may be ignored.

Further, at step 1014, the processor 102 may determine whether all theclusters are covered. If not, the processor 102 may select anothercluster at step 1002. If all the clusters are covered, then theprocessor 102, at step 1016, may combine the positive and negativesamples from the clusters to obtain the first set of data.

If the size of the cluster is not less than a predefined lower thresholdvalue, at step 1012, the processor 102 may select positive samples andnegative samples from the cluster for the first set of data.

In an embodiment, the number of samples in the first set of data isbased on a predefined upper threshold value.

FIG. 11 illustrates the selection of samples from the cluster, inaccordance with an embodiment. The cluster 1102 comprises a group ofdocuments clustered based on the similarity using a clusteringtechnique.

The cluster 1102 comprises a boundary 1108 that separates the interiorsamples and the outliers. The samples within the boundary 1108 may bepositive samples 1110 and the samples that are outside the boundary 1108may be negative samples 1112.

At step 1106, certain number of positive samples 1110 (samples withinthe boundary) may be selected by the processor 102 for the first set oftraining data. Similarly, certain number of negative samples 1112 may beselected by the processor 102 for the first set of training data.

In an embodiment, the number of positive samples selected may be betweenthe lower threshold value and the upper threshold value, if the size ofthe cluster is less than or equal to the upper threshold value.

In an embodiment, the number of positive samples selected may be equalto the upper threshold value, if the size of the cluster is greater thanthe upper threshold value.

In an embodiment, the number of negative samples selected may be equalto 10% of the number of positive samples selected.

In one embodiment, the upper threshold value may be 500.

In one embodiment, each of the graph embedding vectors of the sampledocuments in the cluster may include a threshold score. The positivesamples may comprise of graph embedding vectors with threshold score inthe range of 0.4-0.5 and the negative samples may comprise of graphembedding vectors with threshold score above 0.5.

In one embodiment, the processor 102 may be configured to select thepositive samples that are within the boundary of the cluster.

In one embodiment, the number of graph embedding vectors in the firstset of training data may be less than or equal to the number of graphembedding vectors of the sample documents of the training dataset.

It shall be noted that the processes described above are described assequence of steps; this was done solely for the sake of illustration.Accordingly, it is contemplated that some steps may be added, some stepsmay be omitted, the order of the steps may be re-arranged, or some stepsmay be performed simultaneously.

Although embodiments have been described with reference to specificexample embodiments, it will be evident that various modifications andchanges may be made to these embodiments without departing from thebroader scope of the system 100 and method described herein.Accordingly, the specification and drawings are to be regarded in anillustrative rather than a restrictive sense.

Many alterations and modifications of the present invention will nodoubt become apparent to a person of ordinary skill in the art afterhaving read the foregoing description. It is to be understood that thephraseology or terminology employed herein is for the purpose ofdescription and not of limitation. It is to be understood that thedescription above contains many specifications; these should not beconstrued as limiting the scope of the invention but as merely providingillustrations of some of the personally preferred embodiments of thisinvention. Thus, the scope of the invention should be determined by theappended claims and their legal equivalents rather than by the examplesgiven.

What is claimed is:
 1. A system for graph-based clustering of documents,the system comprises one or more processors configured to: receive adigital copy of a document to convert the document into a graph object;identify and label entities in the document, wherein each of theentities is represented as a node of the graph object; create the graphobject for the received digital copy of the document; generate a graphembedding vector using a graph embedding neural network trained toreceive the graph object as input and generate the graph embeddingvector for the graph object as output; and cluster the graph embeddingvector to a cluster comprising similar looking templates of thedocument.
 2. The system as claimed in claim 1, wherein the one or moreprocessor is configured to create the graph object by: connecting eachof the nodes representing an entity with its neighbouring nodes alongfour directions; and forming edges between each of the nodes and itsneighbouring nodes along four directions.
 3. The system as claimed inclaim 2, wherein the edges are formed between the nodes based on therelative position of each of the nodes with its neighbouring nodes. 4.The system as claimed in claim 1, wherein the graph embedding neuralnetwork is a Siamese network comprising: a first neural networkcomprising: a first encoder; a first graph neural network; and a firstpooling layer; and a second neural network comprising: a second encoder;a second graph neural network; and a second pooling layer.
 5. The systemas claimed in claim 4, wherein the one or more processors is configuredto: train the graph embedding neural network using a training datasetcomprising training documents, wherein graph embedding neural network istrained by: identifying and labelling entities, using the processor, ineach of the training documents, wherein each of the entities isrepresented as a node of a graph object; creating graph objects, usingthe processor, for each of the training documents; computing graph editdistance (GED) matrix, using the processor, for a first batch ofdocuments from the training dataset; inputting a pair of graph objectsand the computed graph edit distance matrix to the graph embeddingneural network, wherein one of the graph objects is input to the firstneural network and the other graph object is input to the second neuralnetwork; generating a graph embedding vector, by the graph embeddingneural network, for each of the input pair of graph objects; andcalculating a similarity score between the graph embedding vectorsgenerated by the first neural network and the second neural network; 6.The system as claimed in claim 5, wherein: the first encoder and thesecond encoder are configured to receive the graph object as input andgenerate an entity type embedding vector of a predefined size; the firstgraph neural network and the second graph neural network are configuredto generate node representations encoding the structural information ofthe documents of the graph objects; and the first pooling layer and thesecond pooling layer are configured to aggregate the noderepresentations and generate the graph embedding vector for the inputgraph object.
 7. The system as claimed in claim 5, wherein the one ormore processor is configured to normalize the computed graph editdistance of the graph edit distance matrix to be between the range of 0to
 1. 8. The system as claimed in claim 5, wherein the graph embeddingvector is vector of the size 1×128.
 9. The system as claimed in claim 5,wherein upon training the graph embedding neural network, only oneneural network from the Siamese network is configured to generate thegraph embedding vector for the input graph object.
 10. The system asclaimed in claim 5, wherein: the similarity score is calculated using acosine similarity function; and the similarity score is represented as(1-GED).
 11. The system as claimed in claim 1, wherein the one or moreprocessors are configured to cluster the graph embedding vectors usingclustering techniques such as partitional clustering such as K-meansclustering, hierarchical clustering such as agglomerative clustering, orspectral clustering.
 12. The system as claimed in claim 1, the systemcomprises a machine learning model configured to classify the documents,wherein the clustered documents are fed as input to the machine learningmodel.